Bioinformatics is an interdisciplinary field that develops methods and
        software tools for understanding biological data. —
        Wikipedia
      
    
    
      A curated list of awesome Bioinformatics software, resources, and
      libraries. Mostly command line based, and free or open-source. Please feel
      free to contribute!
    
    
    
    Table of Contents
    
    
    
    Package suites
    
      Package suites gather software packages and installation tools for
      specific languages or platforms. We have some for bioinformatics software.
    
    
      - 
        
          Bioconductor
          - A plethora of tools for analysis and comprehension of
          high-throughput genomic data, including 1500+ software packages. [
          paper-2004
          | web ]
        
       
      - 
        
          Biopython
          - Freely available tools for biological computing in Python, with
          included cookbook, packaging and thorough documentation. Part of the
          Open Bioinformatics Foundation.
          Contains the very useful
          Entrez
          package for API access to the NCBI databases. [
          paper-2009 |
          web ]
        
       
      - 
        
          Bioconda -
          A channel for the
          conda package manager
          specializing in bioinformatics software. Includes a repository with
          3000+ ready-to-install (with conda install)
          bioinformatics packages. [
          paper-2018 |
          web ]
        
       
      - 
        BioJulia -
        Bioinformatics and computational biology infastructure for the Julia
        programming language. [ web ]
      
 
      - 
        Rust-Bio
        - Rust implementations of algorithms and data structures useful for
        bioinformatics. [
        paper-2016
        ]
      
 
      - 
        SeqAn -
        The modern C++ library for sequence analysis.
      
 
      - 
        (Poly)merase
        - A Go library and command line utility for engineering organisms.
      
 
      - 
        
          Biocaml
          - Biocaml aims to be a high-performance user-friendly library for
          Bioinformatics.
        
       
    
    
    
      - 
        GGD
        - Go Get Data; A command line interface for obtaining genomic data. [
        web ]
      
 
      - 
        SRA-Explorer
        - Easily get SRA download links and other information. [
        web ]
      
 
    
    Data Processing
    Command Line Utilities
    
      - 
        Bioinformatics One Liners
        - Git repo of useful single line commands.
      
 
      - 
        BioNode
        - Modular and universal bioinformatics, Bionode provides pipeable UNIX
        command line tools and JavaScript APIs for bioinformatics analysis
        workflows. [ web ]
      
 
      - 
        bioSyntax
        - Syntax Highlighting for Computational Biology file formats (SAM, VCF,
        GTF, FASTA, PDB, etc…) in vim/less/gedit/sublime. [
        paper-2018 |
        web ]
      
 
      - 
        CSVKit
        - Utilities for working with CSV/Tab-delimited files. [
        web ]
      
 
      - 
        csvtk
        - Another cross-platform, efficient, practical and pretty CSV/TSV
        toolkit. [ web ]
      
 
      - 
        datamash
        - Data transformations and statistics. [
        web ]
      
 
      - 
        easy_qsub
        - Easily submitting PBS jobs with script template. Multiple input files
        supported.
      
 
      - 
        GNU Parallel - General parallelizer that runs jobs in
        parallel on a single multi-core machine.
        Here are some example
        scripts using GNU Parallel. [
        web ]
      
 
      - 
        grabix -
        A wee tool for random access into BGZF files.
      
 
      - 
        gsort -
        Sort genomic files according to a specified order.
      
 
      - 
        tabix -
        Table file index. [
        paper-2011 ]
      
 
      - 
        wormtable
        - Write-once-read-many table for large datasets.
      
 
      - 
        zindex
        - Create an index on a compressed text file.
      
 
    
    Next Generation Sequencing
    Workflow Managers
    
      - 
        BigDataScript
        - A cross-system scripting language for working with big data pipelines
        in computer systems of different sizes and capabilities. [
        paper-2014 |
        web ]
      
 
      - 
        Bpipe -
        A small language for defining pipeline stages and linking them together
        to make pipelines. [ web ]
      
 
      - 
        Common Workflow Language
        - a specification for describing analysis workflows and tools that are
        portable and scalable across a variety of software and hardware
        environments, from workstations to cluster, cloud, and high performance
        computing (HPC) environments. [
        web ]
      
 
      - 
        Cromwell
        - A Workflow Management System geared towards scientific workflows. [
        web ]
      
 
      - 
        Galaxy -
        a popular open-source, web-based platform for data intensive biomedical
        research. Has several features, from data analysis to workflow
        management to visualization tools. [
        paper-2018
        | web ]
      
 
      - 
        Nextflow
          (recommended)
        - A fluent DSL modelled around the UNIX pipe concept, that simplifies
        writing parallel and scalable pipelines in a portable manner. [
        paper-2018 |
        web ]
      
 
      - 
        Ruffus
        - Computation Pipeline library for python widely used in science and
        bioinformatics. [
        paper-2010 |
        web ]
      
 
      - 
        SeqWare
        - Hadoop Oozie-based workflow system focused on genomics data analysis
        in cloud environments. [
        paper-2010 |
        web ]
      
 
      - 
        Snakemake
        - A workflow management system in Python that aims to reduce the
        complexity of creating workflows by providing a fast and comfortable
        execution environment. [
        paper-2018 |
        web ]
      
 
      - 
        Workflow Descriptor Language
        - Workflow standard developed by the Broad. [
        web ]
      
 
    
    Pipelines
    
      - 
        Awesome-Pipeline
        - A list of pipeline resources.
      
 
      - 
        Bactopia
        - A flexible pipeline, built with Nextflow, for the complete analysis of
        bacterial genomes. [ web ]
      
 
      - 
        Bacannot
        - A generic but comprehensive bacterial annotation pipeline, built with
        Nextflow, with nice graphical options for investigating results. [
        web
        ]
      
 
      - 
        bcbio-nextgen
        - Batteries included genomic analysis pipeline for variant and RNA-Seq
        analysis, structural variant calling, annotation, and prediction. [
        web ]
      
 
      - 
        R-Peridot
        - Customizable pipeline for differential expression analysis with an
        intuitive GUI. [
        web ]
      
 
      - 
        ngs-preprocess
        - A pipeline for preprocessing short and long sequencing reads, built
        with Nextflow. [
        web
        ]
      
 
    
    Sequence Processing
    
      Sequence Processing includes tasks such as demultiplexing raw read data,
      and trimming low quality bases.
    
    
      - 
        AfterQC
        - Automatic Filtering, Trimming, Error Removing and Quality Control for
        fastq data. [
        paper-2017 ]
      
 
      - 
        FastQC
        - A quality control tool for high throughput sequence data. [
        web
        ]
      
 
      - 
        Fastqp -
        FASTQ and SAM quality control using Python.
      
 
      - 
        Fastx Tookit
        - FASTQ/A short-reads pre-processing tools: Demultiplexing, trimming,
        clipping, quality filtering, and masking utilities. [
        web ]
      
 
      - 
        MultiQC
        - Aggregate results from bioinformatics analyses across many samples
        into a single report. [
        paper-2016 |
        web ]
      
 
      - 
        SeqFu -
        Sequence manipulation toolkit for FASTA/FASTQ files written in Nim. [
        paper-2021 |
        web ]
      
 
      - 
        SeqKit
        - A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
        in Golang. [
        paper-2016 |
        web ]
      
 
      - 
        seqmagick
        - file format conversion in Biopython in a convenient way. [
        web ]
      
 
      - 
        Seqtk -
        Toolkit for processing sequences in FASTA/Q formats.
      
 
      - 
        smof
        - UNIX-style FASTA manipulation tools.
      
 
    
    Data Analysis
    
      The following items allow for scalable genomic analysis by introducing
      specialized databases.
    
    
      - 
        Hail -
        Scalable genomic analysis.
      
 
      - 
        GLNexus
        - Scalable gVCF merging and joint variant calling for population
        sequencing projects. [
        paper-2018
        ]
      
 
    
    Sequence Alignment
    Pairwise
    
      - 
        Bowtie 2
        - An ultrafast and memory-efficient tool for aligning sequencing reads
        to long reference sequences. [
        paper-2012 |
        web ]
      
 
      - 
        BWA -
        Burrow-Wheeler Aligner for pairwise alignment between DNA sequences.
      
 
      - 
        WFA - the
        wavefront alignment algorithm (WFA) which expoit sequence similarity to
        speed up alignment [
        paper-2020
        ]
      
 
      - 
        Parasail
        - SIMD C library for global, semi-global, and local pairwise sequence
        alignments [
        paper-2016
        ]
      
 
      - 
        MUMmer
        - A system for rapidly aligning entire genomes, whether in complete or
        draft form. [
        paper-1999 |
        paper-2002 |
        paper-2004 |
        web ]
      
 
    
    Multiple Sequence Alignment
    
      - 
        POA -
        Partial-Order Alignment for fast alignment and consensus of multiple
        homologous sequences. [
        paper-2002
        ]
      
 
    
    Clustering
    
    Quantification
    
      - 
        Cufflinks
        - Cufflinks assembles transcripts, estimates their abundances, and tests
        for differential expression and regulation in RNA-Seq samples. [
        paper-2010 ]
      
 
      - 
        RSEM - A
        software package for estimating gene and isoform expression levels from
        RNA-Seq data. [
        paper-2011
        | web ]
      
 
    
    Variant Calling
    
      - 
        DeepVariant
        - Deep learning-based variant caller [
        paper-2018 ]
      
 
      - 
        freebayes
        - Bayesian haplotype-based polymorphism discovery and genotyping. [
        web ]
      
 
      - 
        GATK -
        Variant Discovery in High-Throughput Sequencing Data. [
        web ]
      
 
      - 
        Octopus
        - A polymorphic bayesian genotyping model with wide applicability. [
        paper-2021
        ]
      
 
      - 
        
          bcftools
          - samtools/bcftools are a suite of tools for manipulating NGS data and
          can be used to call variants. [
          paper-2009 |
          web ] #### Structural variant callers
        
       
      - 
        Delly
        - Structural variant discovery by integrated paired-end and split-read
        analysis. [
        paper-2012 ]
      
 
      - 
        lumpy -
        lumpy: a general probabilistic framework for structural variant
        discovery. [
        paper-2014
        ]
      
 
      - 
        manta -
        Structural variant and indel caller for mapped sequencing data. [
        paper-2015 ]
      
 
      - 
        gridss
        - GRIDSS: the Genomic Rearrangement IDentification Software Suite. [
        paper-2017 ]
      
 
      - 
        
          smoove
          - structural variant calling and genotyping with existing tools,
          but,smoothly.
        
       
    
    BAM File Utilities
    
      - 
        Bamtools
        - Collection of tools for working with BAM files. [
        paper-2011
        ]
      
 
      - 
        bam toolbox
        MtDNA:Nuclear Coverage; BAM Toolbox can output the ratio of
        MtDNA:nuclear coverage, a proxy for mitochondrial content.
      
 
      - 
        mergesam
        - Automate common SAM & BAM conversions.
      
 
      - 
        mosdepth
        - fast BAM/CRAM depth calculation for WGS, exome, or targeted
        sequencing. [
        paper-2017 ]
      
 
      - 
        SAMstat
        - Displaying sequence statistics for next-generation sequencing. [
        paper-2010
        | web ]
      
 
      - 
        Somalier
        - Fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs. [
        paper-2020 ]
      
 
      - 
        Telseq -
        Telseq is a tool for estimating telomere length from whole genome
        sequence data. [
        paper-2014
        ]
      
 
    
    VCF File Utilities
    
    GFF BED File Utilities
    
      - 
        AGAT -
        Suite of tools to handle gene annotations in any GTF/GFF format. [
        web ]
      
 
      - 
        gffutils
        - GFF and GTF file manipulation and interconversion. [
        web ]
      
 
      - 
        BEDOPS
        - The fast, highly scalable and easily-parallelizable genome analysis
        toolkit. [
        paper-2012
        ]
      
 
      - 
        Bedtools2
        - A Swiss Army knife for genome arithmetic. [
        paper-2010 |
        paper-2014 |
        web ]
      
 
    
    Variant Simulation
    
      - 
        Bam Surgeon
        - Tools for adding mutations to existing 
.bam files, used
        for testing mutation callers. [
        web
        ]
       
      - 
        wgsim -
        Comes with samtools! - Reads simulator. [
        web
        ]
      
 
    
    Variant Prediction/Annotation
    
      - 
        SIFT -
        Predicts whether an amino acid substitution affects protein function. [
        paper-2003 |
        web ]
      
 
      - 
        SnpEff
        - Genetic variant annotation and effect prediction toolbox. [
        paper-2012
        | web ]
      
 
    
    Python Modules
    Data
    
    
    
    Assembly
    
      - 
        SPAdes -
        SPAdes (St. Petersburg genome assembler) is an assembly toolkit
        containing various assembly pipelines and the de-facto standard for
        prokaryotic genome assemblies.
      
 
      - 
        SKESA -
        SKESA is a de-novo sequence read assembler for microbial genomes. It
        uses conservative heuristics and is designed to create breaks at repeat
        regions in the genome. This leads to excellent sequence quality without
        significantly compromising contiguity.
      
 
    
    Annotation
    
      - 
        Prokka
        - Prokka: rapid prokaryotic genome annotation. Prokka is one of the most
        cited annotation command line tools for microbial genome annotations.
      
 
      - 
        Bakta
        - Bakta is a tool for the rapid & standardized annotation of
        bacterial genomes & plasmids. It provides dbxref-rich and
        sORF-including annotations in machine-readable JSON & bioinformatics
        standard file formats for automatic downstream analysis.
      
 
    
    Long-read sequencing
    Long-read Assembly
    
      - 
        canu - A
        single molecule sequence assembler for genomes large and small.
      
 
      - 
        flye
        - De novo assembler for single molecule sequencing reads using repeat
        graphs.
      
 
      - 
        hifiasm
        - A haplotype-resolved assembler for accurate Hifi reads.
      
 
      - 
        wtdbg2
        - A fuzzy Bruijn graph approach to long noisy reads assembly
      
 
    
    Visualization
    Genome Browsers / Gene Diagrams
    
      The following tools can be used to visualize genomic data or for
      constructing customized visualizations of genomic data including sequence
      data from DNA-Seq, RNA-Seq, and ChIP-Seq, variants, and more.
    
    
      - 
        Squiggle
        - Easy-to-use DNA sequence visualization tool that turns FASTA files
        into browser-based visualizations. [
        paper-2018 |
        web ]
      
 
      - 
        biodalliance
        - Embeddable genome viewer. Integration data from a wide variety of
        sources, and can load data directly from popular genomics file formats
        including bigWig, BAM, and VCF. [
        paper-2011 |
        web ]
      
 
      - 
        BioJS -
        BioJS is a library of over hundred JavaScript components enabling you to
        visualize and process data using current web technologies. [
        paper-2014 |
        web ]
      
 
      - 
        Circleator
        - Flexible circular visualization of genome-associated data with BioPerl
        and SVG. [
        paper-2014 ]
      
 
      - 
        DNAism -
        Horizon chart D3-based JavaScript library for DNA data. [
        paper-2016
        | web ]
      
 
      - 
        IGV js -
        Java-based browser. Fast, efficient, scalable visualization tool for
        genomics data and annotations. Handles a large variety of formats. [
        paper-2019 |
        web ]
      
 
      - 
        Island Plot
        - D3 JavaScript based genome viewer. Constructs SVGs. [
        paper-2015 ]
      
 
      - 
        JBrowse -
        JavaScript genome browser that is highly customizable via plugins and
        track customizations. [
        paper-2016 |
        web ]
      
 
      - 
        PHAT -
        Point and click, cross platform suite for analysing and visualizing
        next-generation sequencing datasets. [
        paper-2018 |
        web ]
      
 
      - 
        pileup.js
        - JavaScript library that can be used to generate interactive and highly
        customizable web-based genome browsers. [
        paper-2016 ]
      
 
      - 
        scribl
        - JavaScript library for drawing canvas-based gene diagrams. [
        paper-2012 |
        web ]
      
 
      - 
        Lucid Align - A modern sequence alignment viewer. [
        web ]
      
 
    
    
    
      - 
        Circos
        - Perl package for circular plots, which are well suited for genomic
        rearrangements. [
        paper-2009 |
        web ]
      
 
      - 
        ClicO FS - An interactive web-based service of Circos.
        [ paper-2015 ]
      
 
      - 
        OmicCircos - R package for circular plots for omics
        data. [
        paper-2014 |
        web
        ]
      
 
      - 
        J-Circos - A Java application for doing interactive
        work with circos plots. [
        paper-2014 |
        web
        ]
      
 
      - 
        rCircos
        - R package for circular plots. [
        paper-2013 |
        web
        ]
      
 
      - 
        fujiplot
        - A circos representation of multiple GWAS results. [
        paper-2018
        ]
      
 
    
    Database Access
    
    Resources
    
    
    
    
    Sequencing
    
      - 
        Next-Generation Sequencing Technologies - Elaine Mardis (2014)
        [1:34:35] - Excellent (technical) overview of next-generation and
        third-generation sequencing technologies, along with some applications
        in cancer research.
      
 
      - 
        Annotated bibliography of *Seq assays
        - List of ~100 papers on various sequencing technologies and assays
        ranging from transcription to transposable element discovery.
      
 
      - 
        For all you seq… (PDF)
        (3456x5471) - Massive infographic by Illumina on illustrating how many
        sequencing techniques work. Techniques cover protein-protein
        interactions, RNA transcription, RNA-protein interactions, RNA low-level
        detection, RNA modifications, RNA structure, DNA rearrangements and
        markers, DNA low-level detection, epigenetics, and DNA-protein
        interactions. References included.
      
 
    
    RNA-Seq
    
      - 
        Review papers on RNA-seq (Biostars)
        - Includes lots of seminal papers on RNA-seq and analysis methods.
      
 
      - 
        Informatics for RNA-seq: A web resource for analysis on the cloud
        - Educational resource on performing RNA-seq analysis in the cloud using
        Amazon AWS cloud services. Topics include preparing the data,
        preprocessing, differential expression, isoform discovery, data
        visualization, and interpretation.
      
 
      - 
        RNA-seqlopedia - RNA-seqlopedia
        provides an awesome overview of RNA-seq and of the choices necessary to
        carry out a successful RNA-seq experiment.
      
 
      - 
        A survey of best practices for RNA-seq data analysis
        - Gives awesome roadmap for RNA-seq computational analyses, including
        challenges/obstacles and things to look out for, but also how you might
        integrate RNA-seq data with other data types.
      
 
      - 
        Stories from the Supplement
        [46:39] - Dr. Lior Pachter shares his stories from the supplement for
        well-known RNA-seq analysis software CuffDiff and
        Cufflinks
        and explains some of their methodologies.
      
 
      - 
        List of RNA-seq Bioinformatics Tools
        - Extensive list on Wikipedia of RNA-seq bioinformatics tools needed in
        analysis, ranging from all parts of an analysis pipeline from quality
        control, alignment, splice analysis, and visualizations.
      
 
      - 
        RNA-seq Analysis
        -
        [@crazyhottommy](https://github.com/crazyhottommy)’s notes on various steps and
        considerations when doing RNA-seq analysis.
      
 
    
    ChIP-Seq
    
    YouTube Channels and Playlists
    
      - 
        Current Topics in Genome Analysis 2016
        - Excellent series of fourteen lectures given at NIH about current
        topics in genomics ranging from sequence analysis, to sequencing
        technologies, and even more translational topics such as genomic
        medicine.
      
 
      - 
        GenomeTV - “GenomeTV
        is NHGRI’s collection of official video resources from lectures, to news
        documentaries, to full video collections of meetings that tackle the
        research, issues and clinical applications of genomic research.”
      
 
      - 
        Leading Strand
        - Keynote lectures from Cold Spring Harbor Laboratory (CSHL) Meetings.
        More on
        The Leading Strand.
      
 
      - 
        Genomics, Big Data and Medicine Seminar Series
        - “Our seminars are dedicated to the critical intersection of GBM,
        delving into ‘bleeding edge’ technology and approaches that will deeply
        shape the future.”
      
 
      - 
        Rafael Irizarry’s Channel
        - Dr. Rafael Irizarry’s lectures and academic talks on statistics for
        genomics.
      
 
      - 
        NIH VideoCasting and Podcasting
        - “NIH VideoCast broadcasts seminars, conferences and meetings live to a
        world-wide audience over the Internet as a real-time streaming video.”
        Not exclusively genomics and bioinformatics video but many great talks
        on domain specific use of bioinformatics and genomics.
      
 
    
    Blogs
    
      - 
        ACGT - Dr. Keith Bradnam writes about
        this “thoughts on biology, genomics, and the ongoing threat to humanity
        from the bogus use of bioinformatics acroynums.”
      
 
      - 
        Opiniomics - Dr. Mick Watson
        write on bioinformatics, genomes, and biology.
      
 
      - 
        Bits of DNA - Dr. Lior
        Pachter writes review and commentary on computational biology.
      
 
      - 
        it is NOT junk -
        Dr. Michael Eisen writes “a blog about genomes, DNA, evolution, open
        science, baseball and other important things”
      
 
    
    Miscellaneous
    
    Online networking groups
    
    License